Table of Contents
1. Introduction
This project aims to create a model to predict home sale prices in Nashville, Tennessee using the Ordinary Least Squares (OLS) regression method. Home price predictions are often inaccurate since prediction models sometimes fail to take into account local variations in home pricing i.e., variations attributable to factors that are local to any given neighborhood. This is relevant since incorrect predictions can lead to homes being either overpriced (homes stay on the market without being sold) or underpriced (seller gets less than what the home is actually worth).
The overall strategy adopted for creating the prediction model for this project involved reducing the error of the model (the difference between oberved and predicted values) without impacting the generalizability of the model. In other words, the focus was on creating a prediction model that could help explain the variability of the observed sales prices without being too specific to the observations (dataset) being analyzed. The error of our prediction model was quite high (MAPE > 0.50). The R-square of the model was .55 i.e., about 55% of the variance in the dependent variable is explained by the model.
2. Data
2.1 Data sources
The primary dataset used in this prediction model was the Nashville home sale price dataset. To supplement our data analysis, additional datasets on parks, public schools and crime were obtained from the Nashville Open Data website. The U.S. Census API Data for certain demographic information relating to median income, population living on food stamps and race were downloaded using the Tidycensus library in R.
2.2 Data cleaning
Extraneous variables such as those containing only ‘NA’ i.e., ‘not applicable’ values and observations with missing values were removed from the Nashville home sale price dataset. Furthermore, it was found that some of the observations in the dataset related to sale prices of vacant residential and commercial lands. These were removed from the main dataset and dealt with separately. Homes (i.e., buildings) are qualitatively different from vacant lands (i.e., no buildings). For instance, certain variables such as building type or building height may be statistically significant and useful for predicting home sale prices but irrelevant for predicting prices of vacant lands. Therefore, leaving such observations in the dataset could lead to potential errors in the prediction model for home sale prices. Keeping this in mind, observations related to vacant land were removed. Note: For the competition vacant land prices were predicted separately. For this report, the data analysis is focused exclusively on home prices.
2.3 Exploratory data analysis
As part of the exploratory data analysis, relationships between the dependent variables and different independent variables were analyzed. Correlations between continuous variables were also examined and chi-square tests were performed to determine whether any significant associations existed between the different categorical variables (2 more levels) in the dataset.
Variable Descriptions
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘’ 1
3. Methods
3.1 Feature engineering
A variety of internal house characteristics were considered as independent categorical variables in the prediction model. Parks and public schools often act as indicators for public service/amenities in most neighborhood. Thus, distance to the nearest park and public school were computed and used as independent variables in the prediction model. Furthermore, some studies have found that crimes that relate to burglary and aggravated assaults have a negative impact on home sale prices ([Homes.com] (https://www.homes.com/blog/2016/05/secure-new-home-research-crime-rates-impact-home-value/)). Thus, the average distance to the five nearest crime incidents relating to burglary and aggravated assaults were computed and included as an independent variable in the prediction model. Average sale prices of the five nearest home was also computed to help account for the considerable clustering seen in distribution of home sale prices. The percentage values were computed for estimates gathered from the census data relating to median household income, race (White and African-American), poverty (based on food stamp receipt) and college education and these percentage values were included as independent variables in the model as well. Finally, some new variable were computed using the information in the home sale price dataset. First, the parcel ID numbers were parsed through and the numbers relating to street were extracted. Next, price per square feet was computed by dividing the home sale price by the gross square footage of each home. Then the average price per square feet for each type of house (single family home, duplex, residential condo, etc.) for each street parcel number was computed. Similarly average price per square feet for each house type in each census tract was also calculated.
3.2 Regression: Prediction Model
The Ordinary Least Squares regression method was used to create the prediction model. The R-Square, mean absolute error and MAPE (mean absolute percentage error) were computed to determine the goodness of fit of the model i.e., how well the predicted values fitted the observed values. K-fold cross validation and spatial cross validation were additionally performed and the MAPEs were calculated to help determine whether the prediction model created was robust enough to be generalizable to unseen/unknown data.
4. Results
| Test Set | |
|---|---|
| MAE | 102588.6345334 |
| MAPE | 0.4857880 |
| R Square | 0.5682351 |
Cross validation test and result
## RMSE Rsquared MAE Resample
## 1 198536.69 0.3092568 109655.02 Fold001
## 2 235317.68 0.4171288 132414.44 Fold002
## 3 145619.40 0.6145433 100219.16 Fold003
## 4 121160.43 0.6525693 87105.09 Fold004
## 5 149648.14 0.6498089 102395.21 Fold005
## 6 290468.25 0.6211593 108398.56 Fold006
## 7 163498.90 0.6459487 78096.17 Fold007
## 8 137812.85 0.5418373 92134.83 Fold008
## 9 161237.49 0.5718516 95373.97 Fold009
## 10 132467.41 0.6082223 80602.74 Fold010
## 11 89474.54 0.8427945 65145.96 Fold011
## 12 133686.50 0.5984117 91405.51 Fold012
## 13 195383.17 0.6395215 112485.14 Fold013
## 14 167325.11 0.5973103 100398.18 Fold014
## 15 115511.98 0.6478628 76820.63 Fold015
## 16 141217.86 0.6166980 91604.42 Fold016
## 17 148023.25 0.6007216 92283.78 Fold017
## 18 141322.25 0.7390135 95977.14 Fold018
## 19 640297.21 0.3966016 158626.77 Fold019
## 20 177470.79 0.4356180 116408.00 Fold020
## 21 611879.19 0.4864402 163519.09 Fold021
## 22 174865.58 0.5672823 97195.80 Fold022
## 23 270121.36 0.3134974 134115.49 Fold023
## 24 175609.50 0.6937315 98537.45 Fold024
## 25 174220.18 0.6705874 96725.21 Fold025
## 26 170189.04 0.5011325 104849.43 Fold026
## 27 251576.52 0.2441147 119079.30 Fold027
## 28 149066.29 0.6149954 96432.67 Fold028
## 29 148013.56 0.5912873 105485.54 Fold029
## 30 191345.78 0.5880513 100462.22 Fold030
## 31 130027.75 0.5144246 87516.72 Fold031
## 32 285474.90 0.3916367 120274.49 Fold032
## 33 231629.78 0.7176000 113749.13 Fold033
## 34 189349.82 0.6204611 95803.18 Fold034
## 35 274688.60 0.3021963 125056.75 Fold035
## 36 127232.91 0.5986963 88069.79 Fold036
## 37 296510.98 0.4635535 142861.38 Fold037
## 38 145694.56 0.6912406 102153.80 Fold038
## 39 241166.41 0.4830084 120946.75 Fold039
## 40 153959.55 0.5776906 98626.50 Fold040
## 41 158663.26 0.5512602 94522.17 Fold041
## 42 123305.15 0.7661119 82126.85 Fold042
## 43 164827.19 0.4311128 94939.57 Fold043
## 44 135942.30 0.6637912 93074.03 Fold044
## 45 150722.51 0.6804490 98507.22 Fold045
## 46 152263.88 0.5924613 98349.35 Fold046
## 47 494720.08 0.2171191 156085.00 Fold047
## 48 163701.77 0.5395329 87567.38 Fold048
## 49 319098.82 0.4389980 131518.32 Fold049
## 50 335873.63 0.4447464 118203.51 Fold050
## 51 157229.82 0.6411369 92758.92 Fold051
## 52 117326.15 0.6899767 79672.51 Fold052
## 53 285538.91 0.1970974 130871.66 Fold053
## 54 142100.51 0.6716214 92450.36 Fold054
## 55 200198.73 0.5991658 97941.36 Fold055
## 56 130367.08 0.6691201 90125.96 Fold056
## 57 159720.41 0.5675676 102235.05 Fold057
## 58 139049.12 0.5835107 86938.39 Fold058
## 59 185991.89 0.6924791 98690.39 Fold059
## 60 248601.52 0.5137792 142547.07 Fold060
## 61 117264.24 0.6406352 85965.06 Fold061
## 62 218863.02 0.5771870 103757.61 Fold062
## 63 123198.59 0.6877299 86381.42 Fold063
## 64 169613.22 0.4308258 102601.90 Fold064
## 65 244984.51 0.5948862 112769.72 Fold065
## 66 217326.54 0.5937683 121605.93 Fold066
## 67 488811.85 0.7306476 148490.06 Fold067
## 68 220386.23 0.6443428 113960.22 Fold068
## 69 139228.71 0.5470259 89333.54 Fold069
## 70 163845.18 0.6573281 94241.59 Fold070
## 71 236320.88 0.5069846 102099.98 Fold071
## 72 212296.47 0.4767386 115464.55 Fold072
## 73 481792.07 0.8054947 169721.62 Fold073
## 74 158907.68 0.5657460 99493.83 Fold074
## 75 89854.07 0.7983547 66010.36 Fold075
## 76 148360.89 0.5680513 97866.50 Fold076
## 77 129714.91 0.6678778 82324.17 Fold077
## 78 151786.98 0.6750157 105603.99 Fold078
## 79 192522.40 0.4340778 115121.50 Fold079
## 80 254393.19 0.5978394 114723.32 Fold080
## 81 171420.63 0.6655408 101738.76 Fold081
## 82 134230.12 0.6655717 89200.32 Fold082
## 83 166241.73 0.6250796 102876.50 Fold083
## 84 277458.29 0.2799600 111173.89 Fold084
## 85 151476.07 0.7290298 99459.12 Fold085
## 86 264295.23 0.3453297 117023.20 Fold086
## 87 225545.47 0.5344037 123992.89 Fold087
## 88 122275.31 0.6259297 86641.81 Fold088
## 89 123928.98 0.5999291 84328.00 Fold089
## 90 187611.97 0.7529063 104147.58 Fold090
## 91 185156.94 0.6039770 108452.75 Fold091
## 92 210267.49 0.5523825 130936.42 Fold092
## 93 143791.04 0.6576353 99708.46 Fold093
## 94 208990.91 0.5755298 95522.42 Fold094
## 95 151602.70 0.6188139 97574.39 Fold095
## 96 97409.00 0.8891876 79476.11 Fold096
## 97 278653.87 0.3939344 119413.36 Fold097
## 98 165047.91 0.5100955 98625.95 Fold098
## 99 177424.56 0.4764152 116079.67 Fold099
## 100 242130.04 0.3889546 118671.38 Fold100
## [1] 19782.94
Moran’s I Test and the Residual Map of Test Set
## Warning in knearneigh(coords, 4): knearneigh: identical points found
##
## Moran I test under randomisation
##
## data: residualsToMap$residual
## weights: nb2listw(spatialWeights, style = "W")
##
## Moran I statistic standard deviate = -1.5991, p-value = 0.9451
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -0.0233051199 -0.0004755112 0.0002038129
## Reading layer `geo_export_c67bfcab-09af-4375-b42b-3631cbd3a654' from data source `/Users/zhang/Downloads/midtermData_507/Zip_Codes/geo_export_c67bfcab-09af-4375-b42b-3631cbd3a654.shp' using driver `ESRI Shapefile'
## Simple feature collection with 51 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -87.05465 ymin: 35.96777 xmax: -86.51157 ymax: 36.4057
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
Spatial cross-validation
#rich zips
rich_zip <- train_df %>% filter(LocationZip %in% c("37203"))
'%notin%' <- Negate('%in%')
other_zip <- train_df %>% filter(LocationZip %in% c("37203"))
richreg <- lm(SalePrice ~ LandUseFullDescription + bedroomsunits_building + baths +
halfbaths + roomsunits_building + as.factor(st_ht) +
bedroomsunits_building + Averge_Sal + median_income + pct_white + pct_black + pct_poverty +
pct_bachelor + crime_distance + parks_distance + schools_distance
, data = other_zip)
richregPred <- predict(richreg, rich_zip)
richregPredValues <-
data.frame(observedSales = rich_zip$SalePrice,
predictedSales = richregPred)
richregPredValues <-
richregPredValues %>%
mutate(error = predictedSales - observedSales) %>%
mutate(absError = abs(predictedSales - observedSales)) %>%
mutate(percentAbsError = abs(predictedSales - observedSales) / observedSales)
r1 <- mean(richregPredValues$absError)
r2 <- mean(richregPredValues$percentAbsError)
#middleincome zips
middle_zip <- train_df %>% filter(LocationZip %in% c("37206"))
'%notin%' <- Negate('%in%')
othermiddle_zip <- train_df %>% filter(LocationZip %in% c("37206"))
middlereg <- lm(SalePrice ~ LandUseFullDescription + bedroomsunits_building + baths +
halfbaths + roomsunits_building + as.factor(st_ht) +
bedroomsunits_building + Averge_Sal + median_income + pct_white + pct_black + pct_poverty +
pct_bachelor + crime_distance + parks_distance + schools_distance
, data = othermiddle_zip)
middleregPred <- predict(middlereg, middle_zip)
middleregPredValues <-
data.frame(observedSales = middle_zip$SalePrice,
predictedSales = middleregPred)
middleregPredValues <-
middleregPredValues %>%
mutate(error = predictedSales - observedSales) %>%
mutate(absError = abs(predictedSales - observedSales)) %>%
mutate(percentAbsError = abs(predictedSales - observedSales) / observedSales)
m1 <- mean(middleregPredValues$absError)
m2 <- mean(middleregPredValues$percentAbsError)
#poor zips
poor_zip <- train_df %>% filter(LocationZip %in% c("37115"))
'%notin%' <- Negate('%in%')
otherpoor_zip <- train_df %>% filter(LocationZip %in% c("37115"))
poorreg <- lm(SalePrice ~ LandUseFullDescription + bedroomsunits_building + baths +
halfbaths + roomsunits_building + as.factor(st_ht) +
bedroomsunits_building + Averge_Sal + median_income + pct_white + pct_black + pct_poverty +
pct_bachelor + crime_distance + parks_distance + schools_distance
, data = otherpoor_zip)
poorregPred <- predict(poorreg, poor_zip)
poorregPredValues <-
data.frame(observedSales = poor_zip$SalePrice,
predictedSales = poorregPred)
poorregPredValues <-
poorregPredValues %>%
mutate(error = predictedSales - observedSales) %>%
mutate(absError = abs(predictedSales - observedSales)) %>%
mutate(percentAbsError = abs(predictedSales - observedSales) / observedSales)
p1 <- mean(poorregPredValues$absError)
p2 <- mean(poorregPredValues$percentAbsError)
df1<- as.data.frame(matrix(c(r1,r2), nrow = 2, ncol = 1))
df2<- as.data.frame(matrix(c(m1,m2), nrow = 2, ncol = 1))
df3<- as.data.frame(matrix(c(p1,p2), nrow = 2, ncol = 1))
out <- cbind(df1,df2,df3)
colnames(out) <- c("rich","middle","poor")
rownames(out) <- c("MAE","MAPE")
out
## rich middle poor
## MAE 219659.8800476 97972.3920910 46664.0650597
## MAPE 0.5487434 0.4049776 0.4878722
# FACETED
ggplot(data=richregPredValues,aes(x=observedSales,y=predictedSales)) +
geom_point(colour='red') +
geom_smooth(method='lm') + labs(title="Observed and Predicted Sale Price \nof Rich Neighborhood", x="Observed",
y="Predicted") + theme(plot.title = element_text(hjust = 0.5))
ggplot(data=middleregPredValues,aes(x=observedSales,y=predictedSales)) +
geom_point(colour='red') +
geom_smooth(method='lm') + labs(title="Observed and Predicted Sale Price \nof Middle Neighborhood", x="Observed",
y="Predicted") + theme(plot.title = element_text(hjust = 0.5))
ggplot(data=poorregPredValues,aes(x=observedSales,y=predictedSales)) +
geom_point(colour='red') +
geom_smooth(method='lm') + labs(title="Observed and Predicted Sale Price \nof Poor Neighborhood", x="Observed",
y="Predicted") + theme(plot.title = element_text(hjust = 0.5))
5. Discussion
As can be seen from the results, the R-square of the regression model is around 50% and the MAPE of the model is quite high. The high MAPE indicates that the model is not robust enough to be generalizable to unseen data and may not be an effective prediction model. One of the most important variables in this model was the average sales price of the nearest five homes. Introducing this variable considerably helped improve the R-square of the model. Also, the internal characteristics of homes especially story height and building type were important variables since they were statistically significant and increased the R-square of the regression. Surprisingly, the price per square feet for each building type for each parcel street number (as extracted from the parcel ID) and the average distance to crime were not very useful predictors in the prediction model.
As can be seen from the spatial cross validation results, the model does reasonably well when predicting home sale prices for low-income neighborhoods. Nevertheless, the model does not do very well when tested on data that includes neighborhoods from all income levels. It can be seen clearly from the map of home sale prices in Nashville that there is a lot of clustering in home sale prices. The main reason why this model underperforms is because the local variations in home sale prices were not well captured by any of the independent variables used in the prediction model. Many of the census data variables (median household income for instance) show similar spatial patterns as home sale prices. However, since the census data information is at the census tract level, these variables introduce a level of scale bias and as a result their usefulness in explaining local variations diminish. Furthermore, many of the internal house characteristics had zero values for residential condos in the home sale price dataset. None of the other variables in the prediction model helped explain/account for the internal characteristics of residential condos and this might have also contributed to the high MAPE.
6. Conclusion
We will not be recommending this prediction model to Zillow since the model needs improvement. Additional independent variables that can help explain some of the spatial clustering in home sale prices need to be included for the model. This will help improve the generalizability and the predictive power of the model on unseen data.